Incremental Learning for Heterogeneous Structure Segmentation in Brain Tumor MRI

Deep learning (DL) models for segmenting various anatomical structures have achieved great success via a static DL model that is trained in a single source domain. Yet, the static DL model is likely to perform poorly in a continually evolving environment, requiring appropriate model updates. In an incremental learning setting, we would expect that well-trained static models are updated, following continually evolving target domain data—e.g., additional lesions or structures of interest—collected from different sites, without catastrophic forgetting. This, however, poses challenges, due to distribution shifts, additional structures not seen during the initial model training, and the absence of training data in a source domain. To address these challenges, in this work, we seek to progressively evolve an “off-the-shelf” trained segmentation model to diverse datasets with additional anatomical categories in a unified manner. Specifically, we first propose a divergence-aware dual-flow module with balanced rigidity and plasticity branches to decouple old and new tasks, which is guided by continuous batch renormalization. Then, a complementary pseudo-label training scheme with self-entropy regularized momentum MixUp decay is developed for adaptive network optimization. We evaluated our framework on a brain tumor segmentation task with continually changing target domains—i.e., new MRI scanners/modalities with incremental structures. Our framework was able to well retain the discriminability of previously learned structures, hence enabling the realistic life-long segmentation model extension along with the widespread accumulation of big medical data.


Introduction
Accurate segmentation of a variety of anatomical structures is a crucial prerequisite for subsequent diagnosis or treatment [28].While recent advances in datadriven deep learning (DL) have achieved superior segmentation performance [29], the segmentation task is often constrained by the availability of costly pixel-wise labeled training datasets.In addition, even if static DL models are trained with extraordinarily large amounts of training datasets in a supervised learning manner [29], there exists a need for a segmentor to update a trained model with new data alongside incremental anatomical structures [24].
In real-world scenarios, clinical databases are often sequentially constructed from various clinical sites with varying imaging protocols [19,20,21,23].As well, labeled anatomical structures are incrementally increased with additional lesions or new structures of interest, depending on study goals or clinical needs [27,18].Furthermore, access to previously used data for training can be restricted, due to data privacy protocols [18,17].Therefore, efficiently utilizing heterogeneous structure-incremental (HSI) learning is highly desired for clinical practice to develop a DL model that can be generalized well for different types of input data and varying structures involved.Straightforwardly fine-tuning DL models with either new structures [30] or heterogeneous data [17] in the absence of the data used for the initial model training, unfortunately, can easily overwrite previously learned knowledge, i.e., catastrophic forgetting [30,17,14].
At present, satisfactory methods applied in the realistic HSI setting are largely unavailable.F irst, recent structure-incremental works cannot deal with domain shift.Early attempts [27] simply used exemplar data in the previous stage.[5,33,30,18] combined a trained model prediction and a new class mask as a pseudo-label.However, predictions from the old model under a domain shift are likely to be unreliable [38].The widely used pooled feature statistics consistency [5,30] is also not applicable for heterogeneous data, since the statistics are domain-specific [2].In addition, a few works [13,25,34] proposed to increase the capacity of networks to avoid directly overwriting parameters that are entangled with old and new knowledge.However, the solutions cannot be domain adaptive.Second, from the perspective of continuous domain adaptation with the consistent class label, old exemplars have been used for the application of prostate MRI segmentation [32].While Li et al. [17] further proposed to recover the missing old stage data with an additional generative model, hallucinating realistic data, given only the trained model itself, is a highly challenging task [31] and may lead to sensitive information leakage [35].T hird, while, for natural image classification, Kundu et al. [16] updated the model for class-incremental unsupervised domain adaption, its class prototype is not applicable for segmentation.
In this work, we propose a unified HSI segmentor evolving framework with a divergence-aware decoupled dual-flow (D 3 F) module, which is adaptively optimized via HSI pseudo-label distillation using a momentum MixUp decay (MMD) scheme.To explicitly avoid the overwriting of previously learned parameters, our D 3 F follows a "divide-and-conquer" strategy to balance the old and new tasks with a fixed rigidity branch and a compensated learnable plasticity branch, which is guided by our novel divergence-aware continuous batch renormalization (cBRN).The complementary knowledge can be flexibly integrated with the model re-parameterization [4].Our additional parameters are constant in training, and 0 in testing.Then, the flexible D 3 F module is trained following the knowledge distillation with novel HSI pseudo-labels.Specifically, inspired by the self-knowledge distillation [15] and self-training [38] that utilize the previous prediction for better generalization, we adaptively construct the HSI pseudo-label with an MMD scheme to smoothly adjust the contribution of potential noisy old model predictions on heterogeneous data and progressively learned new model predictions along with the training.In addition, unsupervised self-entropy minimization is added to further enhance performance.
Our main contributions can be summarized as follow: • To our knowledge, this is the first attempt at realistic HSI segmentation with both incremental structures of interest and diverse domains.
• We propose a divergence-aware decoupled dual-flow module guided by our novel continuous batch renormalization (cBRN) for alleviating the catastrophic forgetting under domain shift scenarios.
• The adaptively constructed HSI pseudo-label with self-training is developed for efficient HSI knowledge distillation.
We evaluated our framework on anatomical structure segmentation tasks from different types of MRI data collected from multiple sites.Our HSI scheme demonstrated superior performance in segmenting all structures with diverse data distributions, surpassing conventional class-incremental methods without considering data shift, by a large margin.

Methodology
For the segmentation model under incremental structures of interest and domain shift scenarios, we are given an off-the-shelf segmentor f θ 0 : X 0 → Y 0 parameterized with θ 0 , which has been trained with the data {x 0 n , y 0 n } N 0 n=1 in an initial source domain D 0 = {X 0 , Y 0 }, where x 0 n ∈ R H×W and y 0 n ∈ R H×W are the paired image slice and its segmentation mask with the height of H and width of W , respectively.There are T consecutive evolving stages with heterogeneous target domains D t = {X t , S t } T t=1 , each with the paired slice set {x t n } N t n=1 ∈ X t and the current stage label set {s t n } N t n=1 ∈ S t , where x t n , s t n ∈ R H×W .Due to heterogeneous domain shifts, X t from different sites or modalities follows diverse distributions across all T stages.Due to incremental anatomical structures, the overall label space, across the previous t stages, Y t is expanded from Y t−1 with the additional annotated structures for delineating all of the structures Y T seen in T stages.

cBRN guided divergence-aware decoupled dual-flow
To alleviate the forgetting through parameter overwriting, caused by both new structures and data shift, we propose a D 3 F module for flexible decoupling and integration of old and new knowledge.Specifically, we duplicate the convolution in each layer initialized with the previous model f θ t−1 to form two branches as in [13,25,34].The first rigidity branch f r θ t is fixed at the stage t to keep the old knowledge we have learned.In contrast, the extended plasticity branch f p θ t is expected to be adaptively updated to learn the new task in D t .At the end of current training stage t, we can flexibly integrate the convolutions in two branches, i.e., {W r t , b r t } and 2 } with the model re-parameterization [4].In fact, the dual-flow model can be regarded as an implicit ensemble scheme [9] to integrate multiple sub-modules with a different focus.In addition, as demonstrated in [6], the fixed modules will regularize the learnable modules to act as the fixed one.Thus, the plasticity modules can also be implicitly encouraged to keep the previous knowledge along with its HSI learning.
However, under the domain shift, it can be sub-optimal to directly average the parameters, since f r θ t may not perform well to predict Y t−1 on X t .It has been demonstrated that batch statistics adaptation plays an important role in domain generalizable model training [22].Therefore, we propose a continual batch renormalization (cBRN) to mitigate the feature statistics divergence between each training batch at a specific stage and the life-long global data distribution.
Of note, as a default block in the modern convolutional neural networks (CNN) [8,37], batch normalization (BN) [11] normalizes the input feature of each CNN channel z ∈ R Hc×Wc with its batch-wise statistics, e.g., mean µ B and standard deviation σ B , and learnable scaling and shifting factors {γ, β} as zi = zi−µ B σ B • γ + β, where i indexes the spatial position in R Hc×Wc .BN assumes that the same mini-batch training and testing distribution [10], which does not hold in HSI.Simply enforcing the same statistics across domains as [5,33,30] can weaken the model expressiveness [36].The recent BRN [10] proposes to rectify the data shift between each batch and the dataset by using the moving average µ and σ along with the training: where η ∈ [0, 1] is applied to balance the global statistics and the current batch.
In addition, γ = σ B σ and β = µ B −µ σ are used in both training and testing.Therefore, BRN renormalizes zi = zi−µ σ to highlight the dependency on the global statistics {µ, σ} in training for a more generalizable model, while limited to the static learning.
In this work, we further explore the potential of BRN in the continuously evolving HSI task to be general for all of domains involved.Specifically, we extend BRN to cBRN across multiple consecutive stages by updating {µ c , σ c } along with all stages of training, which is transferred as shown in Fig. 1.The conventional BN also inherits {µ, σ} for testing, while not being used in training [11].At the stage t, µ c and σ c are succeeded from t − 1 stage, and are updated with the current batch-wise {µ r B , σ r B } and {µ p B , σ p B } in rigidity and plasticity branches: For testing, the two branches in final model f θ T can be merged for the lightweight implementation: Therefore, f T θ does not introduce additional parameters for deployment.

HSI pseudo-label distillation with momentum MixUp decay
The training of our developed f θ t with D 3 F is supervised with the previous model f θ t−1 and current stage data {x t n , s t n } N t n=1 .In conventional class incremental learning, the knowledge distillation [31] is widely used to construct the combined label y t n ∈ R H×W by adding s t n and the prediction of f θ t−1 (x t n ).Then, f θ t can be optimized by the training pairs of {x t n , y t n } N t n=1 .However, with heterogeneous data in different stages, f θ t−1 (x t n ) can be highly unreliable.Simply using it as ground truth cannot guide the correct knowledge transfer.
In this work, we construct a complementary pseudo-label ŷt n ∈ R H×W with a MixUp decay scheme to adaptively exploit the knowledge in the old segmentor for the progressively learned new segmentor.In the initial training epochs, f θ t−1 could be a more reliable supervision signal, while we would expect f θ t can learn to perform better on predicting Y t−1 .Of note, even with the rigidity branch, the integrated network can be largely distracted by the plasticity branch in the initial epochs.Therefore, we propose to dynamically adjust their importance in constructing pseudo-label along with the training progress.Specifically, we MixUp the predictions of f θ t−1 and f θ t w.r.t.Y t−1 , i.e., f θ t (•)[: t − 1], and control their pixel-wise proportion for the pseudo-label ŷt n with MMD: where i indexes each pixel, and λ is the adaptation momentum factor with the exponential decay of iteration I. λ 0 is the initial weight of f θ t−1 (x t n:i ), which is empirically set to 1 to constrain λ ∈ (0, 1].Therefore, the weight of old model prediction can be smoothly decreased along with the training, and f θ t (x t n:i ) gradually represents the target data for the old classes in [: t−1].Of note, we have ground-truth of new structure s t n:i under HSI scenarios [5,33,30,18].We calculate the cross-entropy loss L CE with the pseudo-label ŷt n:i as self-training [15,38].In addition to the old knowledge inherited in f θ t−1 , we propose to explore unsupervised learning protocols to stabilize the initial training.We adopt the widely used self-entropy (SE) minimization [7] as a simple add-on training objective.Specifically, we have the slice-level segmentation SE, which is the averaged entropy of the pixel-wise softmax prediction as In training, the overall optimization loss is formulated as follows: where α is used to balance our HSI distillation and SE minimization terms, and I max is the scheduled iteration.Of note, strictly minimizing the SE can result in a trivial solution of always predicting a one-hot distribution [7], and a linear decreasing of α is usually applied, where λ 0 and α 0 are reset in each stage.

Experiments and Results
We carried out two evaluation settings using the BraTS2018 database [1], including cross-subset (relatively small domain shift) and cross-modality (relatively large domain shift) tasks.The BraTS2018 database is a continually evolving database [1] with a total of 285 glioblastoma or low-grade gliomas subjects, comprising three consecutive subsets, i.e., 30 subjects from BraTS2013 [26], 167  subjects from TCIA [3], and 88 subjects from CBICA [1].Notably, these three subsets were collected from different clinical sites, vendors, or populations [1].Each subject has T1, T1ce, T2, and FLAIR MRI volumes with voxel-wise labels for the tumor core (CoreT), the enhancing tumor (EnhT), and the edema (ED).We incrementally learned CoreT, EnhT, and ED structures throughout three consecutive stages, each following different data distributions.We used subjectindependent 7/1/2 split for training, validation, and testing.For a fair comparison, we adopted the ResNet-based 2D nnU-Net backbone with BN as in [12] for all of the methods and all stages used in this work.

Cross-subset structure incremental evolving
In our cross-subset setting, three structures were sequentially learned across three stages: (CoreT with BraTS2013) → (EnhT with TCIA) → (ED with CBICA).Of note, we used a CoreT segmentator trained with BraTS2013 as our off-the-shelf segmentor in t = 0. Testing involved all subsets and anatomical structures.We compared our framework with the three typical structureincremental (SI-only) segmentation methods, e.g., PLOP [5], MargExcIL [18], and UCD [30], which cannot address the heterogeneous data across stages.As  1, PLOP [5] with additional feature statistic constraints has lower performance than MargExcIL [18], since the feature statistic consistency was not held in HSI scenarios.Of note, the domain-incremental methods [17,32] cannot handle the changing output space.Our proposed HSI framework outperformed SI-only methods [5,18,30] with respect to both DSC and HD, by a large margin.For the anatomical structure CoreT learned in t = 0, the difference between our HSI and these SI-only methods was larger than 10% DSC, which indicates the data shift related forgetting lead to a more severe performance drop in the early stages.We set η = 0.01 and alpha 0 = 10 according to the sensitivity study in the supplementary material.
For the ablation study, we denote HSI-D 3 F as our HSI without the D 3 F module, simply fine-tuning the model parameters.HSI-cBRN used dual-flow to avoid direct overwriting, while the model was not guided by cBRN for more generalized prediction on heterogeneous data.As shown in Table 1, both the dual-flow and cBRN improve the performance.Notably, the dual-flow model with flexible re-parameterization was able to alleviate the overwriting, while our cBRN was developed to deal with heterogeneous data.In addition, HSI-MMD indicates our HSI without the momentum MixUp decay in pseudo-label construction, i.e., simply regarding the prediction of f θ t−1 (x t ) is ground truth for Y t−1 .However, f θ t−1 (x t ) can be quite noisy, due to the low quantification performance of early stage structures, which can be aggravated in the case of the long-term evolving scenario.Of note, the pseudo-label construction is necessary as in [5,18,30].We also provide the qualitative comparison with SI-only methods and ablation studies in Fig. 3.

Cross-modality structure incremental evolving
In our cross-modality setting, three structures were sequentially learned across three stages: (CoreT with T1) → (EnhT with T2) → (ED with T2 FLAIR).Of note, we used the CoreT segmentator trained with T1 modality as our off-theshelf segmentor in t = 0. Testing involved all MRI modalities and all structures.With the hyperparameter validation, we empirically set η = 0.01 and α 0 = 10.
In Table 2, we provide quantitative evaluation results.We can see that our HSI framework outperformed SI-only methods [5,18,30] consistently.The improvement can be even larger, compared with the cross-subset task, since we have much more diverse input data in the cross-modality setting.Catastrophic forgetting can be severe, when we use SI-only method for predicting early stage structures, e.g., CoreT.We also provide the ablation study with respect to D 3 F, cBRN, and MMD in Table 2.The inferior performance of HSI-D 3 F/cBRN/MMD demonstrates the effectiveness of these modules for mitigating domain shifts.

Conclusion
This work proposed an HSI framework under a clinically meaningful scenario, in which clinical databases are sequentially constructed from different sites/imaging protocols with new labels.To alleviate the catastrophic forgetting alongside continuously varying structures and data shifts, our HSI resorted to a D 3 F module for learning and integrating old and new knowledge nimbly.In doing so, we were able to achieve divergence awareness with our cBRN-guided model adaptation for all the data involved.Our framework was optimized with a self-entropy regularized HSI pseudo-label distillation scheme with MMD to efficiently utilize the previous model in different types of MRI data.Our framework demonstrated superior segmentation performance in learning new anatomical structures from cross-subset/modality MRI data.It was experimentally shown that a large improvement in learning anatomic structures was observed.

Fig. 2 :
Fig. 2: Illustration of the proposed HSI pseudo-label distillation with MMD

Fig. 3 :
Fig. 3: Segmentation examples in t = 1 and t = 2 in the cross-subset brain tumor HSI segmentation task.

Table 2 :
Numerical comparisons and ablation studies of the cross-modality brain tumor HSI segmentation task